Could Google Search index serve as a reference or indicator for media workers during COVID-19 pandemic?

library(ggplot2)
library(dplyr)
library(plotly)
library(tidyverse)
library(gridExtra)
library(kableExtra)
library(readr)
library(readxl)
library(tibble)
library(janitor)
library(reshape2)
library(ggthemes)
library(hablar)
library(factoextra)
library(pheatmap)
library(latticeExtra)
library(knitr)
library(htmltools)
library(DT)

# Read Data From Percent_COVID_Index.csv
covid_data = read_csv("Percent_COVID_Index.csv")
covid_data = covid_data %>% convert(num(wfh_rate:socialdistance_rate:medicaltreatment_rate))

Executive Summary

The project aims to utilise the Google Search index to draw useful information and knowledge for media workers in the context of the COVID-19 pandemic. The following datasets were used for research and analysis: (a). COVID-19 dataset from Our World in Data. (b). The Google Search index for 13 search terms. As a result, the most 4 relevant search terms were found by calculating the correlation coefficient between 13 search terms and new cases, the differences in search terms ranking among different groups (clustering countries), and the reasons behind them are analysed. In addition, the search index for the most relevant search term “covid” and new cases, death, test, and positive rates were used to establish a Vector Auto-regression model (VAR) to predict the trend and number of new cases in different countries in the next week. The forecast results, popular search terms, and relevant policies of our countries obtained in the project can be viewed through our Shiny app. In summary, Our Shiny app could be a useful tool for media workers and publishers, and all the results we produced based on the Google Search index could also help and benefit them.

Background

Motivation and Aim

On March 11, 2020, the World Health Organisation expressed concern about the speed and consequences of the spread of COVID-19 and declared COVID-19 a pandemic(1). Today, COVID-19 has been spreading around the world for two years, and it has had a negative impact on the lives of people all over the world. According to a report in 2020 by Haroon and Rizvi(2), the content generated by the news media can cause great panic and can even affect the volatility of stocks. In addition, research shows that regardless of the severity of the infectious disease, people would think that the more news there is about a disease, the worse it gets(3). Therefore, it is an unshirkable obligation for the media to disseminate real information to the public, and it is also conducive to blocking the spread of the epidemic (4). Therefore, the purpose of our project is to utilise the Google Search index to draw useful information and knowledge for media worker and hence help them better publish relevant news in different countries according to what people in different countries are concerned about, and provide them with some reference information, such as the prediction of COVID-19 new cases, local policies and some important local events.

Data

Data Collection

The data were collected from 25 countries during the period 26 January 2020 to 14 April 2022. We selected 25 countries for investigation based on the ranking of most powerful countries given in this website(5). Other than that, the development of local internet and the popularity of information about Covid-19 from countries are also taken into account in the selection process of 25 countries.

The following data were used for research and analysis: (a). The number of COVID-19 new cases, new deaths, new tests, positive_rate and population from COVID-19 dataset. (b). The Google Search index for 13 search terms. The search index represents the search interest of a particular search term relative to the highest point on the chart for the given region and time. The higher the search index indicates the higher the number of searches for this specific search term and thus indicates that more people are interested or concerned about this.

Data Pre-processing

The data for variables extracted from owid-covid dataset are updated daily. To make the subsequent research more smooth with the search index data, the weekly values of these variables are calculated. Moreover, the ratio of the new cases to the population in each country is calculated and named it as “new cases percentage”. To intuitively compare the relationship between 13 search terms and new_case, it is necessary to calculate the growth rate of new_case and the growth rate of the 13 search indexes. Their growth rates are calculated as follows:

\[\frac{New\ Case_{\ this\ week} - New\ Case_{\ last \ week}}{New\ Case_{\ last\ week}}\]

\[\frac{Search\ Index_{\ this\ week} - Search\ Index_{\ last\ week}}{Search\ Index_{\ last\ week}}\] Python was used during the data cleaning process to extract the data needed for the project from 339 data sets (13*26+1. 25 countries’ data sets and one global data set) into one. (There are seven Python files, and their introduction and usage are written in README.md file, Github link:see Appendix). Finally, a dataset named “Percent_COVID_Index.csv” was used as the dataset for the project.

Rolling Window

To explore the relationship between 13 search terms and COVID new cases on a global scale in various countries, the Pearson Correlation Coefficient between the new_case_rate and the growth rate of 13 words in each country was calculated using the Rolling Window method every 90 days (13 weeks). Then the average correlation coefficient of each search term in each country is calculated. Ranking all search terms in each country based on the average correlation coefficient. The top search term in all countries was “COVID”. Because the correlation between “COVID” and “new case” does not reveal what people are most interested about COVID, the fourth-ranked word was added to the study. So far, the three search terms most relevant to new cases in each country have been found. In the world, “COVID”, “lockdown”, “mask”, and “WFH”(Work From Home) are the four most relevant words for new cases. Therefore, the following investigation will focus on these four words.

# Set Countries name
Countries <- c('Australia', 'Belgium', 'Brazil', 'Canada', 'China HongKong', 'France', 'Germany', 'India', 'Israel', 'Italy', 'Japan', 'Netherlands', 'Qatar', 'Russia', 'Saudi Arabia', 'Singapore', 'South Korea', 'Spain', 'Switzerland', 'Thailand', 'Turkey', 'United Arab Emirates', 'United Kingdom', 'United States', 'Vietnam', 'World')
# Set Columns name (Variable name)
rates_2 <- c("covid_search_rate", "education_rate", "flight_rate", "export_rate", "immigration_rate", "lockdown_rate", "marriage_rate", "mask_rate", "medicaltreatment_rate", "socialdistance_rate", "travel_rate", "vaccine_rate", "wfh_rate")

## 90 days = 13 weeks,so set the window size to 13
## Go through every country, Calculate Pearson correlation between each search term and new_case
windowSize = 13
## Used to store all correlation coefficients
corr_all <- c()
## Traverse the search term
for (rate_name in rates_2){
  ## Store country name
  countries_all <- c()
  ## Store date
  date_all <- c()
  ##  Date_index Used to store the sum of new_ase in Window
  new_case_all <- c()
  for (name in Countries){
    each_country_data = covid_data[covid_data$CountryName==name, ]
    Date_index = each_country_data$Date
    ind = seq_len(length(Date_index) - windowSize)
    df <- data.frame(X = each_country_data[[rate_name]],
                   Y = each_country_data$new_cases_rate,
                   Case = each_country_data$new_cases,
                   Date = Date_index)
    # Used to store correlation coefficients for each word
    PearsonStat = rep(NA, length(ind))
    for (i in 1:length(ind)) {
      date_all <- append(date_all, df$Date[ind[i]])
      X_subset <- df$X[df$Date >= df$Date[ind[i]] & df$Date < df$Date[ind[i] + windowSize]]
      Y_subset <- df$Y[df$Date >= df$Date[ind[i]] & df$Date < df$Date[ind[i] + windowSize]]
      new_case_week <- sum(df$Case[df$Date >= df$Date[ind[i]] & df$Date < df$Date[ind[i] + windowSize]])
      new_case_all <- append(new_case_all, new_case_week)
      PearsonStat[i] <- cor(X_subset, Y_subset, method = "pearson")
      countries_all <- append(countries_all, name)
    }
    corr_all <- append(corr_all, PearsonStat)
  }
}
m1 <- matrix(corr_all, ncol = 13, byrow = FALSE)

## Create the dataframe to store all correlation coefficients in all countries
d1 <- as.data.frame(m1)
corr_colnames <- c("covid_search_corr", "education_corr", "flight_corr", "export_corr", "immigration_corr", "lockdown_corr", "marriage_corr", "mask_corr", "medicaltreatment_corr", "socialdistance_corr", "travel_corr", "vaccine_corr", "wfh_corr")
colnames(d1) <- corr_colnames
d1["countries_all"] <- countries_all
d1["date_all"] <- date_all
d1["new_case_all"] <- new_case_all
df <- d1[c("countries_all", "date_all", "new_case_all", "covid_search_corr", "education_corr", "flight_corr", "export_corr", "immigration_corr", "lockdown_corr", "marriage_corr", "mask_corr", "medicaltreatment_corr", "socialdistance_corr", "travel_corr", "vaccine_corr", "wfh_corr")]


## Calculate the average correlation coefficient of each hot word, and store them into dataframe
total_avg <-c()
for (corr_name in corr_colnames){
  avg <- c()
  for (name in Countries) {
      each_country_data = df[df$countries_all==name,]
      corr = each_country_data[[corr_name]]
      corr_avg = mean(corr)
      avg <- append(avg, corr_avg)
  }
  total_avg <- append(total_avg, avg)
}
avg_matrix <- matrix(total_avg, ncol = 13, byrow = FALSE)

avg_df <- as.data.frame(avg_matrix)
avg_colnames <- c("avg_covid_search_corr", "avg_education_corr", "avg_flight_corr", "avg_export_corr", "avg_immigration_corr", "avg_lockdown_corr", "avg_marriage_corr", "avg_mask_corr", "avg_medicaltreatment_corr", "avg_socialdistance_corr", "avg_travel_corr", "avg_vaccine_corr", "avg_wfh_corr")
colnames(avg_df) <- avg_colnames
avg_df["countries_all"] <- Countries
avg_df <- avg_df[c("countries_all", "avg_covid_search_corr", "avg_education_corr", "avg_flight_corr", "avg_export_corr", "avg_immigration_corr", "avg_lockdown_corr", "avg_marriage_corr", "avg_mask_corr", "avg_medicaltreatment_corr", "avg_socialdistance_corr", "avg_travel_corr", "avg_vaccine_corr", "avg_wfh_corr")]


## Calculate the rank of each word in each country and globally based on the average correlation coefficient of each word
i = 1
matrix_rank <- matrix(ncol = 13, nrow = 26)
country_name <- c()
for (name in Countries) {
  country_data <-avg_df[avg_df$countries_all==name, 2:14]
  country_name <- append(country_name, name)
  matrix_rank[i,] <- rank(-country_data)
  i = i+1
}

## Store the results into dataframe
df_rank <- data.frame(matrix_rank)
rank_colnames <- c("covid", "education", "flight", "export", "immigration", "lockdown", "marriage", "mask", "medicaltreatment", "socialdistance", "travel", "vaccine", "wfh")
colnames(df_rank) <- rank_colnames
df_rank["CountryName"] <- country_name
df_rank <- df_rank[c("CountryName", "covid", "education", "flight", "export", "immigration", "lockdown", "marriage", "mask", "medicaltreatment", "socialdistance", "travel", "vaccine", "wfh")]

Methods

Cross-Correlation Function

A further investigation between the time series of the search index of “covid” and the time series of the number of new cases is conducted. During the investigation, a time interval between the search index and new cases was observed in most of the country. The search index is always ahead of new cases. This delay is reasonable as when people develop symptoms, they usually google first then later they may go testing. Cross-correlation is often used to determine the shift between two time-based signals. The offset or lag time between two signals can be found through the maximum value of the cross-correlation. (6) The idea of cross-correlation is extended to time series. The sample cross-correlation function (CCF) is a corresponding function in r i.e. which can be used to investigate the relationship between two time-series data. Hence, the CCF is used to further validate the existence of lag time between search index time series and new cases time series in each country. The result is shown below in Figure 1. In general, the new cases lag the search index. 23 out of 25 countries demonstrate this trend except for Japan and India. There are 6 out of 25 countries that have a lag time greater than 4 weeks. Except for these countries, the remaining countries all demonstrate a reasonable lag time as we expected. Hence, the existence of lag time in each country led us to build a prediction model to predict the trend and number of new cases using the google search index of covid as one of the predictors.

# Construct a dataframe to outline the lag time in each country
lag_time_df <- data.frame(Countries, lag_title, lag_time) 
colnames(lag_time_df) <- c('Country', "Lag", "Offset")


# Summaries all the information into a table
DT::datatable(lag_time_df, caption = HTML('Figure 1: Offset Table <br/> Note: This table displays the lag time (offset) between Search Index and new cases in each country. Column lag indicates which time series lag'
))

Vector Auto-regression Model

The search index was found as the potential predictors of covid new cases. The search index has a similar trend as the new cases. However, it is not sufficient to predict the quantitative number of new cases and the change of new cases can be affected by many aspects. Thus, more time-series data such as new death, new test, and positive rate were also included as potential predictors. Vector Auto-regression is often used to analyze the relationship between multiple time series data and its performance in forecasting. There are many time-series data involved, the VAR model is therefore used to build the prediction model. VAR needs a substantial amount of data to train it, otherwise, R is going to give null results on the estimation of some coefficients. 100 weeks is found as a good cut, as it provides sufficient data to train the model, but is also left with enough test data to evaluate the model. But it is not a big sample size as we only have 100 values to train for each country. Many selection criteria can be used when building the VAR model. Two of them are recommended to use when the sample size is small. They are Akaike’s information criterion (AIC) and final prediction error (FPE). (7) Therefore, we decided to train the VAR model using 100 weeks of data with selection criteria “FPE”.

Clustering

Clustering is used to investigate the percentage of COVID19 new cases in 25 countries. The percentage of COVID 19 new cases is the ratio of the number of COVID 19 new cases by country to the population in each country. Generate a matrix of distances between normalised trajectories for the respective characteristics of each country. Then, using this algorithm, groups similar countries of new cases into clusters and separates four groups to explore. The hierarchical clustering demonstrates in Figure2:

covid_full = covid_data

#Range the countries and date the column
countries = c("Brazil","France", "Germany", "India","Italy", "Spain", "Turkey", "United States", "United Kingdom", "Australia", "Canada", "Singapore","Thailand","Qatar","Netherlands","Belgium","Vietnam","China HongKong","Russia","Switzerland","Japan","South Korea","Saudi Arabia","United Arab Emirates","Israel")
countries <- sort(countries)
covid_full$Date <- as.Date(covid_full$Date)
## selecting the 25 countries and identify time period. 
covid <- covid_full[covid_full$CountryName %in% countries, ]
covid <- covid[ (covid$Date >= "2020-01-26" & covid$Date <= "2022-04-14") , ]

# get all time index of interest 
time_index <- seq(as.Date("2020-01-26"), as.Date('2022-04-14'),'days')

covid_alternative <- NULL # create a new data frame to store result 

for ( i in countries){
  thiscountry <- covid[ covid$CountryName == i , ] 
  thiscountry <- thiscountry[ match(time_index, thiscountry$Date) , ] 
  # ensure the time index is in order of the time index, by matching the two vectors 
  covid_alternative <- rbind(covid_alternative,thiscountry) 
}

# x is first time series, y is second time series
l_p_distance <- function(x, y, p){
    distance = sum((x - y)^p, na.rm = TRUE)^(1/p)
    return(distance)
}

#build matrix of new_cases_percentage for each country
p = 2
#split the column
covid_list = split.data.frame(covid[,c("Date","new_case_percentage")], covid$CountryName)
n = length(countries)
distance_matrix <- matrix(0, n, n)
dateindex = covid_list[[1]]$Date
for (i in 1:n ){
    for (j in 1:n){
          index_i = match(covid_list[[i]]$Date, dateindex)
          index_j = match(covid_list[[j]]$Date, dateindex) 
          ts_i <- covid_list[[i]][index_i,"new_case_percentage"]
          ts_j <- covid_list[[j]][index_j,"new_case_percentage"]
          distance_matrix[i,j] <-  l_p_distance(ts_i, ts_j, p)
    }
}
rownames(distance_matrix) <- colnames(distance_matrix) <- countries
distance_matrix[!is.finite(distance_matrix)] <- 0


matrix_dist <- as.dist(distance_matrix)
#Hierarchical clustering by Ward.D method 
hclust_res <- hclust( matrix_dist, method = "ward.D")  
#hclust_res=color_branches(hclust_res,k=4,col = c(2,3,4,5))
plot(hclust_res,main = 'Figure2: Percentage of COVID19 new cases in 25 countries',ylab = 'Height')
#Separate countries into 4 borders
rect.hclust(hclust_res, k = 4, border =2:5)

Note: Figure2 shows the percentage of COVID19 new cases in 25 countries. The graph shows a new case percentage trajectory line consisting of 25 countries in 4 clusters. The cluster determines the 4 degrees of very strong(light blue cluster), strong(dark blue cluster), natural(green cluster), and weak(red cluster).


The new case percentage trajectory tree consists of 25 countries in 4 clusters. The cluster determines the 4 degrees of very strong, strong, natural, and weak. A low new case percentage characterises the largest cluster in these countries, and the percentage of new cases in these countries is relatively stable from 2020.1.26 to 2022.4.14. Other clusters have a less smooth new case percentage than the largest cluster. The most significant fluctuations discovered in the light blue clusters (consisting of Israel, Netherlands, Belgium, France, and Switzerland) may be due to the frequent occurrence of some significant events. To use colour to label the new case percentage levels in each country:

## Figure out the average ranking of the lockdown, mask and WFH in the different Clustering group

red_group <- c('Russia','Turkey','India','Saudi Arabia', 'Japan', 'Thailand','Brazil', 'United Arab Emirates', 'Canada', 'Qatar')
green_group <- c('South Korea', 'China HongKong', 'Vietnam', 'Germany',  'Singapore')
dark_blue <- c("Australia", 'United Kingdom', 'Italy', 'Spain', 'United States')
light_blue <- c('Israel', 'Netherlands', "Belgium", 'France', 'Switzerland')

# lockdown
lockdown <- c()
lockdown_red <- c()
for (name in red_group) {
  rank = df_rank[df_rank$CountryName == name, ]$lockdown
  lockdown_red <- append(lockdown_red, rank)
}
lockdown <- c(lockdown, mean(lockdown_red))

lockdown_dark<- c()
for (name in dark_blue) {
  rank = df_rank[df_rank$CountryName == name, ]$lockdown

  lockdown_dark <- append(lockdown_dark, rank)
}
lockdown <- c(lockdown ,mean(lockdown_dark))

lockdown_light <- c()
for (name in light_blue) {
  rank = df_rank[df_rank$CountryName == name, ]$lockdown

  lockdown_light <- append(lockdown_light, rank)
}
lockdown <- c(lockdown, mean(lockdown_light))

lockdown_green <- c()
for (name in green_group) {
  rank = df_rank[df_rank$CountryName == name, ]$lockdown

  lockdown_green <- append(lockdown_green, rank)
}
lockdown <- c(lockdown, mean(lockdown_green))

# mask
mask <- c()
mask_red <- c()
for (name in red_group) {
  rank = df_rank[df_rank$CountryName == name, ]$mask

  mask_red <- append(mask_red, rank)
}
mask <- c(mask, mean(mask_red))

mask_dark<- c()
for (name in dark_blue) {
  rank = df_rank[df_rank$CountryName == name, ]$mask
  mask_dark <- append(mask_dark, rank)
}
mask <- c(mask, mean(mask_dark))

mask_light <- c()
for (name in light_blue) {
  rank = df_rank[df_rank$CountryName == name, ]$mask
  mask_light <- append(mask_light, rank)
}
mask <- c(mask, mean(mask_light))

mask_green <- c()
for (name in green_group) {
  rank = df_rank[df_rank$CountryName == name, ]$mask
  mask_green <- append(mask_green, rank)
}
mask <- c(mask, mean(mask_green))

# wfh
wfh <- c()
wfh_red <- c()
for (name in red_group) {
  rank = df_rank[df_rank$CountryName == name, ]$wfh
  wfh_red <- append(wfh_red, rank)
}
wfh <- c(wfh, mean(wfh_red))

wfh_dark<- c()
for (name in dark_blue) {
  rank = df_rank[df_rank$CountryName == name, ]$wfh

  wfh_dark <- append(wfh_dark, rank)
}
wfh <- c(wfh, mean(wfh_dark))

wfh_light <- c()
for (name in light_blue) {
  rank = df_rank[df_rank$CountryName == name, ]$wfh

  wfh_light <- append(wfh_light, rank)
}
wfh <- c(wfh, mean(wfh_light))

wfh_green <- c()
for (name in green_group) {
  rank = df_rank[df_rank$CountryName == name, ]$wfh
  wfh_green <- append(wfh_green, rank)
}
wfh <- c(wfh, mean(wfh_green))

df_cluster <- data.frame(Groups = c('red group', 'dark blue group', 'light blue group', 'green group'),
                         Lockdown = lockdown,
                         Mask = mask,
                         WFH = wfh)
kbl(df_cluster, align = "c",caption = "Figure3: The average ranking of the three popular search terms in the four groups") %>%
  kable_classic() %>%footnote(general = "Figure 3 shows the correlation between the ranking of 3 most popular search terms in the search index and each cluster group. In general, the light blue group has the strongest relationship for each popular word and the red group has the weakest relationship.")
Figure3: The average ranking of the three popular search terms in the four groups
Groups Lockdown Mask WFH
red group 5.2 5.0 7.7
dark blue group 2.4 3.8 5.0
light blue group 2.8 4.8 7.4
green group 4.8 8.6 3.2
Note:
Figure 3 shows the correlation between the ranking of 3 most popular search terms in the search index and each cluster group. In general, the light blue group has the strongest relationship for each popular word and the red group has the weakest relationship.

Trajectory associations and interplay

We explore the differences between the three search terms in different clusters, figure out the average ranking for “lockdown”, “mask” and "WFH’’ in each cluster and analyse relevant policy through popular search terms. Figure 3 shows the percentage of new cases rate by country in 4 different colour clusters. There is some correlation between the ranking of the 3 popular search terms in each cluster. The red cluster is ranked lower in the list of buzzwords for countries like Russia, Turkey, and Thailand. Thus the ratio of new additions is the weakest. While the strong new case percentage like light blue clusters is even higher in the list of clusters, like Belgium and France, resulting in a relatively high search index for popular search terms. Therefore, a part of the data in correlation is not consistent with the data in the figure out.

Trajectory similarity

The next step will illustrate the similarity of countries in one cluster. Almost all countries in the largest cluster(red) do not specify a lockdown about policy. For example, the Russian government has no plans to introduce new coronavirus-related restrictions. The government may even cancel some limits due to peculiarities of the omicron. Maybe the citizens are rare attention to the lockdown for the new case in Russian. (8).

Result

Evaluation of Vector Auto-regression model

Figure 4 visualizes the performance of the model, judging by the closeness between prediction and actual data. Every sub-plot is obtained by conducting a one-step forecast on test data iteratively. The blue line is the actual weekly new case, while the red line consists of consecutive one-step predictions made by the VAR model. The model is performed relatively well in most countries except for Australia and Thailand. This is reasonable for the model performance being poor in some countries as there are many factors influencing the number of new cases and our model only takes 5 factors into account. For example, the new cases in Australia stay flat for almost 2 years (which spans the training set) before opening the borders and easing the restrictions. In those times, the relationship between factors and new cases is different from the time when the borders are open. This makes the prediction model invalid and hence results in a poor performance in Australia.

### Function --> Create the subset data for specific country
create_subset <- function(data, country) {
  model_vars = data %>% 
    dplyr::filter(CountryName==country) %>% 
    dplyr::select(new_cases, COVID_search_index, new_death_weekly, new_test_weekly, positive_rate_weekly)
  return(model_vars)
}

### Function --> Create vector auto-regression (VAR) model 
make_var <- function(model_vars, train = 100) {
  # select the order of the VAR model
  lag = vars::VARselect(model_vars[1:train,])$selection[4][[1]] + 1
  # Create the VAR model using 100 weeks with information creteria 'FPE'
  model = vars::VAR(model_vars[1:train,],lag.max = lag, ic="FPE")
  return(model)
}

### Function --> Predict the next week new cases number given --> x: data used for prediction, varest --> VAR model
VAR.pred <- function(x, varest){
 lag = varest$p
 nvars = ncol(varest$y)
 
 # Initialize a coefficient matrix 
 coefMatrix = matrix(NA, nvars, 1+nvars*lag)
 
 # Fill the coefficient matrix using the equation of VAR model
 for(k in 1:nvars) {
   coefMatrix[k, ] = (coef(varest)[[k]])[, 1]
 }

 # Extract the constant 
 cst = as.matrix(coefMatrix[, ncol(coefMatrix)])
 # Exclude the constant column from the coefficient matrix
 M = coefMatrix[, -ncol(coefMatrix)]
 # Initialize the matrix for prediction data
 prediction = matrix(NA, 1, nvars)
 # subset the data for only the weeks within the lag
 x_subset = as.matrix(x[nrow(x):(nrow(x)- lag + 1), ])
 # Apply the equation to data where lag = 1
 nextWeek = M[, 1:nvars]%*%t(x_subset)[, 1]
 # Apply the equation to data where lag > 1
 for(l in 2:lag) {
  nextWeek = nextWeek + M[, (1 + nvars*(l-1)):(nvars*l)]%*%t(x_subset)[, l]
 }
 # Apply the constants
 nextWeek = nextWeek + cst
 # Transpose the result matrix and fill them in the prediction martix 
 prediction[1, ] = t(nextWeek) 

 result = data.frame(prediction)
 names(result) = dimnames(x)[[2]]
 return(result)
}

### Function --> Test the VAR model using test data from test start to test end
test_var <- function(model, data, test_start, test_end) {
  predictions = c()
  truevals=c()
  # Iteratively conduct one-step forcast
  for (i in test_start:test_end) {
    pred = VAR.pred(x = data[1:i,] ,varest = model)[1][[1]]
    trueval = data[i+1,1][[1]]
    truevals = c(truevals, trueval)
    predictions = c(predictions, pred)
  }
  return(data.frame(prediction = predictions, actual = truevals))
}

### Function (For Shiny App) --> Output the predicted new cases number (next week) and trend for a specific country and date 
prediction_result <- function(date, country, covid_data) {
  date_filter = covid_data %>% dplyr::filter(CountryName==country) %>% dplyr::select(Date)
  dates = as.character(date_filter$Date)
  this_week = which(dates == date)
  target_data = create_subset(data = covid_data,country = country)
  model = make_var(target_data, train = 100)
  direction = ""
  next_week_case = VAR.pred(x = target_data[1:this_week,], varest = model)[,1]
  if (next_week_case > target_data$new_cases[this_week]) {
    direction = "increase"
  }
  if (next_week_case < target_data$new_cases[this_week]) {
    direction = "decrease"
  }
  if (next_week_case == target_data$new_cases[this_week]) {
    direction = "level"
  }
  return(data.frame(new_case_next_week = next_week_case, direction = direction))
}

### Function --> Draw the line plot for predicted and actual new cases and return it
draw_compare_ret <- function(comparison_data) {
  comparison_data = cbind(week = 1:nrow(comparison_data), comparison_data)
  d <- melt(comparison_data, id.vars=c("week"))
  plot = ggplot(d, aes(x=week, y=value, col=variable)) + geom_point() + geom_line() + 
    xlab("Week")+ ylab("New Cases") + ggtitle(stringr::str_glue("{country}")) + 
    theme(text = element_text(size = 10), 
          legend.position = "bottom",
          legend.text = element_text(size = 10),
          plot.title = element_text(family = "serif", face = "bold", size = 10, hjust = 0.5, 
                                    vjust = 2, angle = 0, lineheight = 20, margin = margin(20, 0, 0, 0)), 
          plot.caption = element_text(hjust = 0.5, colour = "brown4"), plot.caption.position = "panel")
  return(plot)
}

### Function --> Return a data frame that store the predicted and actual trend of the new cases for the testing data
test_direction <- function(model, data, test_start, test_end) {
  predictions = c()
  truedirs=c()

  for (i in test_start:test_end) {
    pred = VAR.pred(x = data[1:i,] ,varest = model)[1][[1]]
    this_week = data[i,1][[1]]
    trueval = data[i+1,1][[1]]
    truedir = ""
    pred_dir = ""
    if (trueval > this_week) { truedir = "increase"}
    if (trueval < this_week) { truedir = "decrease"}
    if (trueval == this_week) {truedir = "level"}
    if (pred > this_week) {pred_dir = "increase"}
    if (pred < this_week) { pred_dir = "decrease"}
    if (pred == this_week) {pred_dir = "level"}
    truedirs = c(truedirs, truedir)
    predictions = c(predictions, pred_dir)
  }
  return(data.frame(prediction = predictions, actual = truedirs))
}

# Filter out the country name
distinct_countries = c(covid_data[,1] %>% dplyr::distinct(CountryName))
n_countries = lengths(distinct_countries)
trend_accuracy = c()
MAE_all = c()
RMSE_all = c()
MAPE_all = c()

for (i in 1:n_countries) {
  country = distinct_countries$CountryName[[i]]
  ## Exclude the countries that do not have sufficient data to make prediction
  if (country != "Brazil" && country != "Qatar" && country != "Singapore" && country != "Vietnam" && country != "World") {
  
  # VAR model for each country
  model = make_var(create_subset(data = covid_data,country = country), train = 100)
  
  # result --> Actual new cases, predicted new cases
  result = test_var(model, data = create_subset(data = covid_data,country = country), test_start = 100, test_end=115)
  
  # Calculate the Mean Average Error(MAE) and Root Mean Squared Error(RMSE)
  MAE = Metrics::mae(actual = result$actual, predicted = result$prediction)
  RMSE = Metrics::rmse(actual = result$actual, predicted = result$prediction)
  MAPE = MLmetrics::MAPE(result$prediction, result$actual)
  MAE_all = append(MAE_all, round(MAE,digit =2))
  RMSE_all = append(RMSE_all, round(RMSE, digit = 2))
  MAPE_all = append(MAPE_all, MAPE)
  
  
  # direction result --> Actual trend for new cases, predicted trend for new cases
  direction_result = test_direction(model, data = create_subset(data = covid_data,country = country), test_start = 100, test_end=115)
  trend_accuracy = c(trend_accuracy, round(mean(direction_result$prediction == direction_result$actual), digit = 2)*100)
  }
}
## Output the line plot for predicted and actual new cases for every country
myplots <- list()
j = 1
for (i in 1:n_countries) {
  country = distinct_countries$CountryName[[i]]
  if (country != "Brazil" && country != "Qatar" && country != "Singapore" && country != "Vietnam" && country != "World") {
    # Apply function to get the prediction result and graph
    result = test_var(make_var(create_subset(data = covid_data,country = country), train = 100), data = create_subset(data = covid_data,country = country), test_start = 100, test_end=115)
    myplots[[j]] = draw_compare_ret(result)
    j = j + 1
  }
}

# Combine the graphs for different countries into one figure and added title,caption to it
figure2 <- ggpubr::ggarrange(plotlist = myplots, ncol=3, nrow=7, common.legend = TRUE, legend="bottom")
title <- expression(atop(bold("Actual vs predicted new cases in different countries"), scriptstyle("Figure 4: This figure shows forecasting result on the VAR model, visually compare the actual and predicted new csaes in different countries for the testing data (16 weeks)")))
ggpubr::annotate_figure(figure2,top= ggpubr::text_grob(title, size = 19))

Given the error nature of forecasting models, there is a need to compare the efficiency of data for decision-making across countries in terms of new case forecasting. Therefore, the model evaluation metrics MAPE, MAE, and RMSE were calculated to assess the VAR model performance. MAPE was usually used to measure the accuracy of the forecast system. The smaller the MAPE the better the forecast.(9) Moreover, a smaller MAE for a country indicates a minor error and a better estimator than other countries. A country with a low RMSE indicates a minor average model prediction error in a country. The country with lower MAE and RMSE means closer to reality than other countries’ forecasts. For example, from Figure 5, the MAE and RMSE are lower in the United Arab Emirates(AE) compared to other countries, which means the model’s model prediction error is better than in other countries. Moreover, the MAPE of AE is 16% which means the average difference between the forecasted value and the actual value is 16% in AE. Thus, the VAR model performs reasonably well in AE.

Countries <- c("Australia", "Belgium", "Canada", "France", "Germany", "HongKong(China)", "India", "Israel", "Italy","Japan", "Netherlands", "Russia", "Saudi Arabia", "South Korea", "Spain", "Switzerland", "Thailand", "Turkey", "United Arab Emirates", "United Kingdom", "United States")

mean_trend_accuracy = mean(trend_accuracy) # 0.7410714

# Put all evaluation value into a dataframe
var_evaluate_df <- data.frame(Countries, MAPE_all, MAE_all, RMSE_all, trend_accuracy) 

# Convert the infinity to 0 for rounding
MAPE_all[!is.finite(MAPE_all)] <- 0
var_evaluate_df$MAPE_all <- round(var_evaluate_df$MAPE_all, digit = 2)*100

# Convert 0 back to infinity to keep the data valid and authentic
for (i in 1:length(var_evaluate_df$MAPE_all)) {
  if (var_evaluate_df$MAPE_all[i] == 0 ){var_evaluate_df$MAPE_all[i] == Inf}
}

colnames(var_evaluate_df) <- c('Country', 'MAPE (%)', 'MAE', 'RMSE', 'Trend Accuracy (%)')
#DT::datatable(var_evaluate_df, options = list(pageLength = 5), caption = "Figure 3: Evaluation table of VAR model")

# Summaries all the information into a table
DT::datatable(var_evaluate_df, caption = HTML('Figure 5: Summary of evaluation metrics of the VAR model <br/> Note: This table use MAPE, MAE, RMSE, and Trend Accuracy as standard metrics to evaluate the VAR model for each country.'
))

Predicting the exact number of new cases is a complex problem as it is influenced by multitude factors and uncertainty. Alternatively, the VAR model can be treated to predict the trend of new cases (increase or decrease) for next week. From Figure 5, 9 countries have an accuracy of over 80% in predicting the trend of new cases and the overall accuracy is 74%. Overall, the VAR model performs well in many countries and its performance is robust and generalizable. When importing up-to-date datasets in the future, the VAR model will still work and continue to produce the results, even increasing its accuracy. Moreover, the VAR model serves media workers and can also apply to other audiences.

Shiny App

According to the above research, a shiny app suitable for all media workers is designed. The purpose of this app is to provide a more convenient creative environment for media workers who produce covid related content. All models and functions in the app are based on real-world data and national policies. On the home page, the interactive earth on the right shows the top three search terms of different countries during the epidemic period. Users can hover over each country/location to view it. In the lower part of the page, there is a scatter plot of new cases in various countries drawn with a dygraph package, which can view new cases and big events related to the epidemic situation in various countries in different periods. On the right is a brief introduction to the big event and related links. On the cluster page, users can select a country to observe the percentage of new cases in the country’s total population. Users can find the policy details of the country corresponding to the keyword by selecting the search terms. For example, selecting ‘Japan’ and ‘Lockdown’ ,Shiny will show the colour group of Japan and the policy of lockdown in Japan. On the model page, users can see the predicted new cases and trends for next week by selecting the country and date. All data used by the shiny app is displayed on the data page. Users can filter and view the data they are interested in. Finally, the help page also introduces the usage and data sources of the shiny app to help users in doubt. In general, this app mainly uses interactive visualisation, cluster analysis and prediction model building methods to deeply analyse covid and search term data, and assist users in better creating covid related content.

Discussion and Conclusion

There are some potential shortcomings involved in this project. For example, there may not be enough variables from different aspects to support our predictions. We only considered the five most important influencing factors in our research, leading to selectivity bias. The potential shortcoming can be solved by further analysis that contributes to the accuracy of the prediction model by adding more indexes related to covid 19. For individual countries, the prediction of COVID-19 may be significant in some aspects, For example, taking into account the uncertainty of vaccine policy in advanced or backward regions for covid forecasts. In addition, in some countries whose mother tongue is not English, such as Japan and Thailand, the most popular one is their mother tongue. However, in our research, international terms are used, which may explain that the prediction models of very few countries are inaccurate in our prediction results.

The empirical results show that using the Google search index can effectively provide helpful information and knowledge for journalists under the background of the COVID-19 epidemic, and by establishing the VAR model, it can also predict the trend and number of new cases in different countries next week. On the other hand, the research also provides the fact of prediction accuracy based on the data related to covid-19.

Moreover, more time-series data can make us realize real-time prediction (10), especially for journalists who need real-time reporting. The reliability and real-time performance of news are paramount (11). Therefore, collecting more data about the COVID-19 epidemic and accessing the real-time interface are needed in the future. The second point is considering the data integrity of non-English-speaking countries and looking for more relevant policies to support the differences between different countries. By synthesising these data, we can further provide better materials for journalists. In addition, more new data based on covid 19 is also helpful in establishing and evaluating a more accurate prediction model. Apart from this, Adding data from more countries in the future will help expand the media’s audience area and provide more news related to Covid-19 help to media from different countries.

Contribution

  • Cheng Zeng:Data collection, cleaning(6 python files and 1 ipynb file). Writing correlation and rolling windows code. Responsible for Github and README. Participating in Summary, Background, and Conclusion of the report

  • Enqi Liu: Proposed the overall topic, wrote code for lag time analysis, prediction model and evaluation and also wrote the corresponding report. Formatting and proofreading the report.

  • Xinyi Yan: Visualisation for clustering, Explore policy and Big events, the report of clustering and evaluation parts

  • Yang Hu: The Clustering algorithm and figure 2,policy and Big events reference. Report that supplement clustering and evaluation generalisation.

  • Xiao Zhang: Finding and generating data and correlational information, report discussion and summary writing

  • Hongyuan Guo: Shiny App, Conclusion and result of Shiny App in the report

Reference

  1. Archived: WHO Timeline - COVID-19 [Internet]. Who.int. 2020 [cited 30 May 2022]. Available from: https://www.who.int/news/item/27-04-2020-who-timeline---covid-19

  2. Haroon O, Rizvi S. COVID-19: Media coverage and financial markets behavior—A sectoral inquiry. Journal of Behavioral and Experimental Finance. 2020;27:100343. https://doi.org/10.1016/j.jbef.2020.100343

  3. Young M, King N, Harper S, Humphreys K. The influence of popular media on perceptions of personal and population risk in possible disease outbreaks. Health, Risk & Society. 2013;15(1):103-114. https://doi.org/10.1080/13698575.2012.748884

  4. Matta G. Science communication as a preventative tool in the COVID19 pandemic. Humanities and Social Sciences Communications. 2020;7(1). https://doi.org/10.1057/s41599-020-00645-1

  5. Most Powerful Countries 2022 [Internet]. Worldpopulationreview.com. 2022 [cited 31 May 2022]. Available from: https://worldpopulationreview.com/country-rankings/most-powerful-countries

  6. Hilbrich M, Müller-Pfefferkorn R. Cross-Correlation as Tool to Determine the Similarity of Series of Measurements for Big-Data Analysis Tasks. Cloud Computing and Big Data. 2016;:263-282. https://doi.org/10.1007/978-3-319-28430-9_20

  7. Liew, Venus. (2004). Which Lag Selection Criteria Should We Employ?. Economics Bulletin. 3. 1-9.

  8. No plans of lockdown in Russia over coronavirus — Putin [Internet]. Tass.com. 2022 [cited 1 June 2022]. Available from: https://tass.com/society/1397555?utm_source=google.com&utm_medium=organic&utm_campaign=google.com&utm_referrer=google.com

  9. -MEAN ABSOLUTE PERCENTAGE ERROR (MAPE). (2000). In: Swamidass P.M. (eds) Encyclopedia of Production and Manufacturing Management. Springer, Boston, MA . https://doi.org/10.1007/1-4020-0612-8_580

  10. Shahriari S, Hossein Rashidi T, Azad A, Vafaee F. COVIDSpread: real-time prediction of COVID-19 spread based on time-series modelling. F1000Research. 2021;10:1110. https://doi.org/10.12688/f1000research.73969.1

  11. Drok N. BEACONS OF RELIABILITY. Journalism Practice. 2013;7(2):145-162. https://doi-org.ezproxy.library.sydney.edu.au/10.1080/17512786.2012.753209

Appendix

  • Session Info
sessionInfo()
## R version 4.2.0 (2022-04-22)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] DT_0.23             htmltools_0.5.2     knitr_1.39         
##  [4] latticeExtra_0.6-29 lattice_0.20-45     pheatmap_1.0.12    
##  [7] factoextra_1.0.7    hablar_0.3.0        ggthemes_4.2.4     
## [10] reshape2_1.4.4      janitor_2.1.0       readxl_1.4.0       
## [13] kableExtra_1.3.4    gridExtra_2.3       forcats_0.5.1      
## [16] stringr_1.4.0       purrr_0.3.4         readr_2.1.2        
## [19] tidyr_1.2.0         tibble_3.1.7        tidyverse_1.3.1    
## [22] plotly_4.10.0       dplyr_1.0.9         ggplot2_3.3.6      
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_2.0-3   ggsignif_0.6.3     ellipsis_0.3.2     snakecase_0.11.0  
##  [5] fs_1.5.2           rstudioapi_0.13    ggpubr_0.4.0       farver_2.1.0      
##  [9] ggrepel_0.9.1      bit64_4.0.5        fansi_1.0.3        lubridate_1.8.0   
## [13] xml2_1.3.3         jsonlite_1.8.0     Metrics_0.1.4      broom_0.8.0       
## [17] dbplyr_2.1.1       png_0.1-7          compiler_4.2.0     httr_1.4.3        
## [21] backports_1.4.1    assertthat_0.2.1   fastmap_1.1.0      lazyeval_0.2.2    
## [25] strucchange_1.5-2  cli_3.3.0          tools_4.2.0        gtable_0.3.0      
## [29] glue_1.6.2         Rcpp_1.0.8.3       carData_3.0-5      cellranger_1.1.0  
## [33] jquerylib_0.1.4    vctrs_0.4.1        urca_1.3-0         svglite_2.1.0     
## [37] nlme_3.1-157       vars_1.5-6         crosstalk_1.2.0    lmtest_0.9-40     
## [41] xfun_0.31          rvest_1.0.2        lifecycle_1.0.1    rstatix_0.7.0     
## [45] MASS_7.3-56        zoo_1.8-10         MLmetrics_1.1.1    scales_1.2.0      
## [49] vroom_1.5.7        hms_1.1.1          parallel_4.2.0     sandwich_3.0-1    
## [53] RColorBrewer_1.1-3 yaml_2.3.5         sass_0.4.1         stringi_1.7.6     
## [57] highr_0.9          rlang_1.0.2        pkgconfig_2.0.3    systemfonts_1.0.4 
## [61] evaluate_0.15      htmlwidgets_1.5.4  labeling_0.4.2     cowplot_1.1.1     
## [65] bit_4.0.4          tidyselect_1.1.2   plyr_1.8.7         magrittr_2.0.3    
## [69] bookdown_0.26      R6_2.5.1           generics_0.1.2     DBI_1.1.2         
## [73] pillar_1.7.0       haven_2.5.0        withr_2.5.0        abind_1.4-5       
## [77] modelr_0.1.8       crayon_1.5.1       car_3.0-13         utf8_1.2.2        
## [81] tzdb_0.3.0         rmarkdown_2.14     jpeg_0.1-9         grid_4.2.0        
## [85] data.table_1.14.2  rmdformats_1.0.4   reprex_2.0.1       digest_0.6.29     
## [89] webshot_0.5.3      munsell_0.5.0      viridisLite_0.4.0  bslib_0.3.1
## Find the top 4 hot words in each country and world
top_3 <- matrix(ncol = 5, nrow = 26)
i = 1
for (name in country_name){
  country <- c()
  ranks = df_rank[df_rank$CountryName == name, ][2: 14]
  country <- append(country, name)
  for (colname in rank_colnames){
    if (ranks[colname]==1){
      country <- append(country, colname)
    }
    if (ranks[colname]==2){
      country <- append(country, colname)
    }
    if (ranks[colname]==3){
      country <- append(country, colname)
    }
    if (ranks[colname]==4){
      country <- append(country, colname)
    }
  }
  top_3[i,] <-  country
  i = i+1
}
top_3_df <- data.frame(top_3)
colnames(top_3_df) <- c('CountryName', 'Fisrt Rank', 'Second Rank', 'Third Rank', 'Fourth Rank')
kable(top_3_df, align = 'c')
CountryName Fisrt Rank Second Rank Third Rank Fourth Rank
Australia covid lockdown mask wfh
Belgium covid lockdown mask socialdistance
Brazil covid education export vaccine
Canada covid lockdown mask vaccine
China HongKong covid socialdistance vaccine wfh
France covid lockdown socialdistance vaccine
Germany covid export lockdown wfh
India covid lockdown mask wfh
Israel covid lockdown mask vaccine
Italy covid education lockdown mask
Japan covid export mask vaccine
Netherlands covid lockdown mask vaccine
Qatar covid medicaltreatment travel vaccine
Russia covid immigration lockdown vaccine
Saudi Arabia covid education lockdown vaccine
Singapore covid education flight wfh
South Korea covid education export socialdistance
Spain covid education lockdown mask
Switzerland covid lockdown mask vaccine
Thailand covid mask socialdistance wfh
Turkey covid education export vaccine
United Arab Emirates covid education mask vaccine
United Kingdom covid lockdown mask wfh
United States covid education lockdown vaccine
Vietnam covid lockdown mask wfh
World covid lockdown mask wfh
# create the pheatmap
clustering <- pheatmap(distance_matrix, 
                 cluster_cols = T,
                 cluster_rows = T,
                 main = "L^2 distance", 
                 clustering_method = "ward.D")
clustering

# Run over range of p values
p_vector <- c( 2, 6, 10)
covid_list = split.data.frame(covid[,c("Date","new_case_percentage")], covid$CountryName)

LPdistance = function(p)
{
  n = length(countries)
  distance_matrix <- matrix(0, n, n)
  dateindex = covid_list[[1]]$Date
  for (i in 1:n ){
      for (j in 1:n){
            index_i = match(covid_list[[i]]$Date, dateindex)
            index_j = match(covid_list[[j]]$Date, dateindex) 
            ts_i <- covid_list[[i]][index_i,"new_case_percentage"]
            ts_j <- covid_list[[j]][index_j,"new_case_percentage"]
            distance_matrix[i,j] <-  l_p_distance(ts_i, ts_j, p)
      }
  }
  rownames(distance_matrix) <- colnames(distance_matrix) <- countries
  distance_matrix[!is.finite(distance_matrix)] <- 0 ## should be NA
  return(distance_matrix)
}
 
## Calculate distance matrix for different p
distmatrix_list = lapply(p_vector, LPdistance)
names(distmatrix_list) = paste("p = ", p_vector)
  
## Select the matrix to visualize
clustering_list = list()
for(i in 1: length(distmatrix_list))
{
  clustering_list[i] <- pheatmap(distmatrix_list[[i]], 
                         cluster_cols = T,
                         main = names(distmatrix_list)[i],
                         clustering_method = "ward.D")
}

#Circle plot part
#data cleaning
#Get out of world data
Final = covid_data[which(covid_data$CountryName != "World"),]
#group by CountryName,then summarise mean of new_cases_percentage
data = Final %>% group_by(CountryName) %>% summarise(percent = mean(new_case_percentage)) 

#Sort countries into different color groups
#Final = read.csv("Percent_COVID_Index.csv")
Final = Final[which(Final$CountryName != "World"),]
#Classify the countries into color clusters
Final = Final %>%
  mutate(group = case_when(
    CountryName == "Australia" ~ "darkblue",
    CountryName == "United Kingdom" ~ "darkblue",
    CountryName == "Italy" ~ "darkblue",
    CountryName == "Spain" ~ "darkblue",
    CountryName == "United States" ~ "darkblue",
    CountryName == "Israel" ~ "lightblue",
    CountryName == "Netherlands" ~ "lightblue",
    CountryName == "Belgium" ~ "lightblue",
    CountryName == "France" ~ "lightblue",
    CountryName == "Switzerland" ~ "lightblue",
    CountryName == "South Korea" ~ "green",
    CountryName == "Hong Kong" ~ "green",
    CountryName == "Vietnam" ~ "green",
    CountryName == "Germany" ~ "green",
    CountryName == "Singapore" ~ "green",
    CountryName == "Russia" ~ "red",
    CountryName == "Turkey" ~ "red",
    CountryName == "Saudi Arabia" ~ "red",
    CountryName == "India" ~ "red",
    CountryName == "Japan" ~ "red",
    CountryName == "Thailand" ~ "red",
    CountryName == "Brazil" ~ "red",
    CountryName == "United Arab Emirates" ~ "red",
    CountryName == "Canada" ~ "red",
    CountryName== "Qatar" ~ "red",
    CountryName == "World" ~ "gg"))
#groupby data into countryname and color group.
data = Final %>% group_by(CountryName, group) %>% summarise(percent = mean(new_case_percentage)) 
## `summarise()` has grouped output by 'CountryName'. You can override using the
## `.groups` argument.
#range the data by group
data <- data.frame(data %>% arrange(group))
#summarise the mean of each group
data %>% filter(group == "lightblue") %>% summarise(mean(percent))
data %>% filter(group == "darkblue") %>% summarise(mean(percent))
data %>% filter(group == "green") %>% summarise(mean(percent))
data %>% filter(group == "red") %>% summarise(mean(percent))
#create circular barplot of different countries
# Set a number of 'empty bar' to add at the end of each group
empty_bar <- 4
to_add <- data.frame(matrix(NA, empty_bar*nlevels(data$group), ncol(data)) )
colnames(to_add) <- colnames(data)
to_add$group <- rep(levels(data$group), each=empty_bar)
data <- rbind(data, to_add)
data <- data %>% arrange(group)
data$id <- seq(1, nrow(data))

# Get the name and the y position of each label
label_data <- data
number_of_bar <- nrow(label_data)
angle <- 90 - 360 * (label_data$id-0.5) /number_of_bar     # I substract 0.5 because the letter must have the angle of the center of the bars. Not extreme right(1) or extreme left (0)
label_data$hjust <- ifelse( angle < -90, 1, 0)
label_data$angle <- ifelse(angle < -90, angle+180, angle)
 
# Make the plot

data$group_factor <- factor(data$group,levels = c("darkblue","green","lightblue","red"),labels = c("darkblue","green","lightblue","red"))
p <- ggplot(data, aes(x=as.factor(id), y=percent)) +  # Note that id is a factor. If x is numeric, there is some space between the first bar
  geom_bar(stat="identity", alpha=5, fill=data$group_factor)+ggtitle("new case percentage of each group") +
  ylim(-5,8) +
  theme_minimal() +
  theme(
    legend.position = "none",
    axis.text = element_blank(),
    axis.title = element_blank(),
    panel.grid = element_blank(),
    plot.margin = unit(rep(-1,4), "cm") 
  ) +
  coord_polar() + 
  geom_text(data=label_data, aes(x=id, y=percent+1, label=CountryName, hjust=hjust), 
            color="black", fontface="bold",alpha=8, size=2.5, angle= label_data$angle, inherit.aes = FALSE) 
p

# Order data
#create 3D box plot the show thea data
library(latticeExtra)
#creat the data frame of clusters
rank = data.frame(
  policy = as.factor(rep(c("lockdown","mask","WFH"),4)),
  label_color = as.factor(rep(c("Red","Dark blue","Light blue","Green"),3)),
  value = as.integer(c(5.2,3.8,7.4,4.8,5.0,5.0,2.8,8.6,7.7,2.4,4.8,3.2)))

#set the color of each bar
# library(RColorBrewer)
# mycolors<-brewer.pal(3, "Blues")
mycolors <- c("Red","Dark blue","Light blue","Green")
#creat the plot
cloud(value~label_color+policy,rank, panel.3d.cloud=panel.3dbars,col.facet=mycolors,
      xbase=0.4, ybase=0.4, scales=list(arrows=FALSE, col="brown"), 
      par.settings = list(axis.line = list(col = "transparent")))